GitHub

concept extractor

This script is experimental stuff. It uses keyword-miner to extract the most frequently used keywords from a website and concept-net to associate the keywords to a concept/keyword you specify.

It results in a list of keywords with a corresponding weight that roughly represents the word's similarity to the specified concept.

# install$ npm install mateogianolio/concept-extractor$ npm test

This work includes data from ConceptNet 5, which was compiled by the Commonsense Computing Initiative. ConceptNet 5 is freely available under the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from http://conceptnet5.media.mit.edu. The included data was created by contributors to Commonsense Computing projects, contributors to Wikimedia projects, Games with a Purpose, Princeton University's WordNet, DBPedia, OpenCyc, and Umbel.

Example (from test.js)var extract = require('concept-extractor');// the options object is the same format as in keyword-minervar options = { site: 'https://en.wikipedia.org/wiki/Data_mining', limit: 10, element: 'body' // only extract keywords from the "body" element};function done(error, results) { if (error)throw error; // results contains an array of objects like this: // { word: 'software', weight: 0.8768781818186884 } var weights = results.map(function (result) { return result.weight;}); console.log('results:'); console.log(results); console.log('---'); console.log('maximum weight:', Math.max.apply(null, weights)); console.log();}extract(options, 'computer_science', done);

Output for the concept computer_science:

results:[ { word: 'software', weight: 0.8768781818186884 }, { word: 'mining', weight: 0 }, { word: 'data', weight: 0.549747883288543 }, { word: 'machine', weight: 0.3341132075367523 }, { word: 'learning', weight: 0 }, { word: 'conference', weight: 0 }, { word: 'knowledge', weight: 0.15239643485924595 }, { word: 'information', weight: 0.2516386427849485 }, { word: 'discovery', weight: 0.08791355407164494 }, { word: 'analysis', weight: 0.3324896054503274 } ]---maximum weight: 0.8768781818186884

Output for the concept mathematics:

results:[ { word: 'mining', weight: 0 }, { word: 'learning', weight: 0 }, { word: 'analysis', weight: 0.5252067463151946 }, { word: 'conference', weight: 0.022289074816239547 }, { word: 'software', weight: 0.16668583972978826 }, { word: 'knowledge', weight: 0.1031308909706193 }, { word: 'machine', weight: 0.022998221696578863 }, { word: 'discovery', weight: 0.10574254835077147 }, { word: 'data', weight: 0.03327238213471822 }, { word: 'information', weight: 0 } ]---maximum weight: 0.5252067463151946

Output for the concept economics:

results:[ { word: 'learning', weight: 0 }, { word: 'analysis', weight: 0.2115613784875437 }, { word: 'conference', weight: 0 }, { word: 'data', weight: 0.026387914962943407 }, { word: 'machine', weight: 0.049241795454380406 }, { word: 'mining', weight: 0 }, { word: 'discovery', weight: 0.056693457235950984 }, { word: 'software', weight: 0.10938039014364835 }, { word: 'knowledge', weight: 0.059521765646348854 }, { word: 'information', weight: 0 } ]---maximum weight: 0.2115613784875437

Output for the concept physics:

results:[ { word: 'learning', weight: 0 }, { word: 'software', weight: 0 }, { word: 'mining', weight: 0 }, { word: 'analysis', weight: 0 }, { word: 'machine', weight: 0 }, { word: 'discovery', weight: 0 }, { word: 'information', weight: 0 }, { word: 'knowledge', weight: 0 }, { word: 'conference', weight: 0 }, { word: 'data', weight: 0 } ]---maximum weight: 0

I do believe that with some improvements this could prove really useful for text categorization/classification.

Contribute

This was pretty much hacked together in a day so it could probably be improved in a hundred different ways. If you know one, don't hesitate to send a PR or submit an issue.

云奕文章网

GitHub

相关推荐：